ROCm と HIP：詳細な10章構成チュートリアル：HIP最適化の科学的方法

HIP環境における最適化は、 厳密な経験的学問 直感的な推測の連続として扱うべきではありません。体系的なワークフローを採用することで、開発者はすべてのコード変更がデータによって正当化されることを保証し、パフォーマンス工学を「最適化の迷信」から、仮説と検証の繰り返し可能な科学的サイクルへと移行させます。

6段階のワークフロー

HIPパフォーマンスガイドラインでは、体系的な手順を推奨しています：

ベースラインを測定する：現在の実行時間とスループットを確認します。
プログラムをプロファイリングする： rocprofv3 ハードウェアカウンターを収集するために使用します。
ボトルネックを特定する：計算制限、メモリ制限、またはレイテンシ制限かどうかを判断します。
ターゲット最適化を適用する：特定されたボトルネックにのみ注目します。
再測定する：変更が実際にパフォーマンス向上につながったかを確認します。
反復する：目標が達成されるまでこのプロセスを繰り返します。

最適化の迷信を避ける

パフォーマンスの向上は、特定のハードウェアとの相互作用から得られる再現可能な結果でなければなりません。以下の アンチパターンを避けましょう：

現在のパフォーマンスを測定する前にカーネルコードを変更すること。
カーネルがメモリ制限かどうかを把握せずにブロックサイズを調整すること。
特定のワークロードに対して重要である証拠なしに、占有率の数値を追うこと。

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

What is the very first step in the HIP optimization scientific method?

Identify the primary hardware bottleneck.

Measure a baseline performance metric.

Apply loop unrolling to kernels.

Tune thread block sizes for maximum occupancy.

QUESTION 2

Which of these is considered an 'Optimization Superstition'?

Using profiling tools to check memory bandwidth.

Applying optimizations before verifying the bottleneck.

Iterating the process after re-measuring.

Matching data precision to hardware capabilities.

QUESTION 3

Why is chasing high occupancy numbers without proof often counterproductive?

Higher occupancy always leads to higher latency.

Occupancy doesn't matter for AMD architectures.

It may force the compiler to spill registers, reducing performance despite more active threads.

It prevents kernels from using HBM2 memory.

QUESTION 4

If you replace `float` with `double` and performance drops significantly, what have you likely identified?

A compute-bound bottleneck on FP32 units.

A host-side synchronization error.

A failure in the ROCm compiler JIT.

That block size tuning is mandatory.

QUESTION 5

What is the recommended tool for Step 2 (Profile the program) in modern ROCm environments?

gdb

rocprofv3

htop

amd-config